Student Performance
Can Language Models Teach? Teacher Explanations Improve Student Performance via Personalization
A hallmark property of explainable AI models is the ability to teach other agents, communicating knowledge of how to perform a task. While Large Language Models (LLMs) perform complex reasoning by generating explanations for their predictions, it is unclear whether they also make good teachers for weaker agents. To address this, we consider a student-teacher framework between two LLM agents and study if, when, and how the teacher should intervene with natural language explanations to improve the student's performance. Since communication is expensive, we define a budget such that the teacher only communicates explanations for a fraction of the data, after which the student should perform well on its own. We decompose the teaching problem along four axes: (1) if teacher's test time intervention improve student predictions, (2) when it is worth explaining a data point, (3) how the teacher should personalize explanations to better teach the student, and (4) if teacher explanations also improve student performance on future unexplained data.
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
Zbeeb, Mohammad, Hammoud, Hasan Abed Al Kader, Mukalled, Sina, Rizk, Nadine, Karnib, Fatima, Lakkis, Issam, Mohanna, Ammar, Ghanem, Bernard
The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Middle East > Lebanon > Beirut Governorate > Beirut (0.05)
- (10 more...)
- Research Report (0.51)
- Questionnaire & Opinion Survey (0.34)
Automatic Essay Scoring and Feedback Generation in Basque Language Learning
Azurmendi, Ekhi, Arregi, Xabier, de Lacalle, Oier Lopez
This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Spain > Basque Country (0.04)
- Europe > Faroe Islands > Streymoy > Tórshavn (0.04)
- Education > Assessment & Standards > Student Performance (0.73)
- Education > Curriculum > Subject-Specific Education (0.50)
Uncovering Students' Inquiry Patterns in GenAI-Supported Clinical Practice: An Integration of Epistemic Network Analysis and Sequential Pattern Mining
Wei, Jiameng, Dang, Dinh, Yang, Kaixun, Stokes, Emily, Mazeh, Amna, Lim, Angelina, Dai, David Wei, Moore, Joel, Fan, Yizhou, Gasevic, Danijela, Gasevic, Dragan, Chen, Guanliang
Assessment of medication history-taking has traditionally relied on human observation, limiting scalability and detailed performance data. While Generative AI (GenAI) platforms enable extensive data collection and learning analytics provide powerful methods for analyzing educational traces, these approaches remain largely underexplored in pharmacy clinical training. This study addresses this gap by applying learning analytics to understand how students develop clinical communication competencies with GenAI-powered virtual patients -- a crucial endeavor given the diversity of student cohorts, varying language backgrounds, and the limited opportunities for individualized feedback in traditional training settings. We analyzed 323 students' interaction logs across Australian and Malaysian institutions, comprising 50,871 coded utterances from 1,487 student-GenAI dialogues. Combining Epistemic Network Analysis to model inquiry co-occurrences with Sequential Pattern Mining to capture temporal sequences, we found that high performers demonstrated strategic deployment of information recognition behaviors. Specifically, high performers centered inquiry on recognizing clinically relevant information, integrating rapport-building and structural organization, while low performers remained in routine question-verification loops. Demographic factors including first-language background, prior pharmacy work experience, and institutional context, also shaped distinct inquiry patterns. These findings reveal inquiry patterns that may indicate clinical reasoning development in GenAI-assisted contexts, providing methodological insights for health professions education assessment and informing adaptive GenAI system design that supports diverse learning pathways.
- Oceania > Australia (0.06)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- Asia > Malaysia (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Instructional Material (1.00)
- Health & Medicine > Health Care Providers & Services (0.91)
- Education > Assessment & Standards > Student Performance (0.34)
Generating Reading Comprehension Exercises with Large Language Models for Educational Applications
Huang, Xingyu, Jiang, Fei, Xiao, Jianli
With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.
- Asia > Singapore (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
Hua, Haowei, Jiao, Hong, Wang, Xinyi
The majority of summarized essays fall well below the 512 - token limit (marked by the red dashed line), indicating that the summarization process effectively compressed the original texts while maintaining consistency in length. The smooth decline beyond 300 tokens and the sparse occurrence of samples approaching the upper l imit suggest that v ery few summaries exceeded the intended compression threshold. Overall, this distribution demonstrates that the GPT - 5 - mini summarizer produced concise and length - stable representations, ensuring efficient model input handling and minimizing the risk of truncation in downstream processing.
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > New York (0.04)
- (4 more...)
- Education > Assessment & Standards > Student Performance (0.73)
- Food & Agriculture > Agriculture (0.68)
- Information Technology > Security & Privacy (0.46)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.31)
IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection
Miandoab, Kaveh Eskandari, Kowalyshyn, Katharine, Pamnani, Kabir, Gavhera, Anesu, Sarathy, Vasanth, Scheutz, Matthias
IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user's understanding of a given text.
- Europe > Austria > Vienna (0.15)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.05)
- Europe > Switzerland (0.05)
- Education > Assessment & Standards > Student Performance (0.57)
- Education > Educational Technology > Educational Software (0.56)
LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation
Byun, Grace, Rajwal, Swati, Choi, Jinho D.
Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.69)
- Education > Assessment & Standards > Student Performance (0.68)
- Education > Educational Setting > Higher Education (0.67)
- Asia > Singapore (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)